Slides

The slide deck includes the examples in this R notebook as well as links to cheatsheets for each tidyverse package and additional reading

About {palmerpenguins}

install.packages(c(
  "devtools",
  "tidyverse",
  "palmerpenguins"))

devtools::install_github("hadley/emo")

📦 Loading packages

# loading packages
library(tidyverse)
library(palmerpenguins)
library(emo)

# viewing data sets in package "palmerpenguins"
data(package = "palmerpenguins")

readr

Let’s get data into R!

# option 1: load using URL ----
raw_adelie_url <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-pal.219.3&entityid=002f3893385f710df69eeebe893144ff")

# option 2: load using filepath ----
raw_adelie_filepath <- read_csv("raw_adelie.csv")

Lucky for us, Allison Horst compiled data from all three species together for us in the {palmerpenguins} package!

  • penguins contains a clean dataset, and
  • penguins_raw contains raw data
# saves package tibble into global environment
penguins <- palmerpenguins::penguins 
head(penguins)
penguins_raw <- palmerpenguins::penguins_raw
head(penguins_raw)

tibble

A tibble is much like the data frame in base R, but optimized for use in the Tidyverse. Let’s take a look at the differences.

# try each of these commands in the console and see if you can spot the differences!

as_tibble(penguins)
as.data.frame(penguins)

What differences do you see?

You might see a tibble prints:

  • variable classes
  • only 10 rows
  • only as many columns as can fit on the screen
  • NAs are highlighted in console so they’re easy to spot (font highlighting and styling in tibble)

Not so much a concern in an R Markdown file, but noticeable in the console. Print method makes it easier to work with large datasets.

There are a couple of other main differences, namely in subsetting and recycling. Check them out in the `vignette(“tibble”)

Try it out here!

vignette("tibble")

Taking a closer look at penguins

Get a full view of the dataset:

View(penguins)

Or catch a glimpse:

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex               <fct> male, female, female, NA, female, male, female, mal…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…

ggplot2

Let’s start by making a simple plot of our data!

ggplot2 uses the “Grammar of Graphics” and layers graphical components together to create a plot.

Let’s see if body mass varies by penguin sex

penguins %>%
  ggplot()

penguins %>%
  ggplot(aes(x = sex, y = body_mass_g))

penguins %>%
  ggplot(aes(x = sex, y = body_mass_g)) +
  geom_point()

# A scatter plot doesn't really tell us much.
# Let's try a different geometry

penguins %>%
  ggplot(aes(x = sex, y = body_mass_g)) +
  geom_boxplot()

# That's more informative!
# Let's see if there are differences by penguin species

penguins %>%
  ggplot(aes(x = sex, y = body_mass_g)) +
  geom_boxplot(aes(fill = species))

# What do you notice?

What observations can you make from the plot?

You might see:

  • Gentoo penguins have higher body mass than Adelie and Chinstrap penguins
  • Higher body mass among male Gentoo penguins compared to female penguins
  • Pattern not as discernable when comparing Adelie and Chinstrap penguins
  • No NAs among Chinstrap penguin data points! sex was available for each observation

I wonder what percentage of observations are NA for each species? Let’s get the tidyverse to help us with this!

Next stop, dplyr!

dplyr

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ade…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgers…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1,…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1,…
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475,…
## $ sex               <fct> male, female, female, NA, female, male, female, mal…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 200…

select()

Selecting dataset columns with select()

penguins %>%
  select(species, sex, body_mass_g)

arrange()

Reordering the data set with arrange()

penguins %>%
  select(species, sex, body_mass_g) %>%
  arrange(desc(body_mass_g))

group_by() and summarize()

Summarizing the data using group_by() and summarize()

We can use group_by() to group our data by species and sex, and summarize() to calculate the average body_mass_g for each grouping.

penguins %>%
  select(species, sex, body_mass_g) %>%
  group_by(species, sex) %>%         
  summarize(mean = mean(body_mass_g))

count() and add_count()

If we’re just interested in counting the observations in each grouping, we can group and summarize with special functions count() and add_count().

Counting can be done with group_by() and summarize(), but it’s a little cumbersome.

It involves… 1. using mutate() to create an intermediate variable n_species that adds up all observations per species, and 2. an ungroup()-ing step

penguins %>% 
  group_by(species) %>%
  mutate(n_species = n()) %>%            
  ungroup() %>%                          
  group_by(species, sex, n_species) %>%
  summarize(n = n())

In contrast, count() and add_count() offer a simplified approach.

Thank you to Alison Hill for this suggestion!

penguins %>% 
  count(species, sex) %>%
  add_count(species, wt = n,    
            name = "n_species") 

mutate()

We can add to our counting example by using mutate() to create a new variable prop, which represents the proportion of penguins of each sex, grouped by species

Thank you to Alison Hill for this suggestions!

penguins %>% 
  count(species, sex) %>%
  add_count(species, wt = n, 
            name = "n_species") %>%
  mutate(prop = n/n_species*100) 

filter()

Finally, we can filter rows to only show us Chinstrap penguin summaries by adding filter() to our pipeline

penguins %>% 
  count(species, sex) %>%
  add_count(species, wt = n, 
            name = "n_species") %>%
  mutate(prop = n/n_species*100) %>%
  filter(species == "Chinstrap")

forcats

Currently the year variable in penguins is continuous from 2007 to 2009.

There may be situations where this isn’t what we want and we might want to turn it into a categorical variable instead.

The factor() function is perfect for this.

penguins %>%
  mutate(year_factor = factor(year, levels = unique(year)))

The result is a new factor year_factor with levels 2007, 2008 and 2009!

penguins_new <-
  penguins %>%
  mutate(year_factor = factor(year, levels = unique(year)))
penguins_new

Double check the variable class and factor levels below:

class(penguins_new$year_factor)
## [1] "factor"
levels(penguins_new$year_factor)
## [1] "2007" "2008" "2009"

stringr

Let’s play around with strings a little bit!

From what we’ve learned so far, take a guess at what this code chunk will do before running it.

penguins %>%
  select(species, island) %>%
  mutate(ISLAND = str_to_upper(island))

How about this one? How is it different from the previous code chunk?

penguins %>%
  select(species, island) %>%
  mutate(ISLAND = str_to_upper(island)) %>%
  mutate(species_island = str_c(species, ISLAND, sep = "_"))

tidyr

Both penguin datasets are already tidy!

We can pretend that it wasn’t and that body_mass_g was recorded separately for male, female, and sex NA penguins. Like untidy_penguins below:

untidy_penguins <-
  penguins %>%
    pivot_wider(names_from = sex,
                values_from = body_mass_g)
untidy_penguins

Now let’s make it tidy again with the help of the pivot_longer() function! pivot_wider()is another very popular tidying function. Have you seen it before? Hint: see the code chunk above!

untidy_penguins %>%
  pivot_longer(cols = male:`NA`, 
               names_to = "sex",
               values_to = "body_mass_g")

purrr

Ok, we love our earlier boxplot showing us body_mass_g by sex and colored by species… but let’s change up the colors to keep with our Antarctica theme!

I’m a big fan of the color palettes in the nord 📦

Let’s turn this plot:

penguins %>%
  ggplot(aes(x = sex, y = body_mass_g)) +
  geom_boxplot(aes(fill = species))

Into this one!

penguins %>%
  ggplot(aes(x = sex, y = body_mass_g)) +
  geom_boxplot(aes(fill = species)) +
  scale_fill_manual(values = nord::nord_palettes$frost)

Let’s try out the frost palette.

# we'll need to load the {nord} package
library(nord)

# you can choose colors using the color hex codes
nord::nord_palettes$frost
## [1] "#8FBCBB" "#88C0D0" "#81A1C1" "#5E81AC"
# but you might prefer to use `scale_fill_manual()` 
# or more specialized functions like `scale_fill_nord()` 
# included in the {nord} package
penguins %>%
  ggplot(aes(x = sex, y = body_mass_g)) +
  geom_boxplot(aes(fill = species)) +
  scale_fill_manual(values = nord::nord_palettes$frost)

  #scale_fill_nord(palette = "frost")

Ok now for a handy package/function trio!

# we'll have to load the {prismatic} package
library(prismatic)

prismatic::color(nord::nord_palettes$frost)
## <colors>
## #8FBCBBFF #88C0D0FF #81A1C1FF #5E81ACFF

purrr’s map() function can help us iterate the prismatic::color() function over all palettes in a palette package like nord!

Note: Not all colors will show well, like polarnight below. prismatic::color() relies on a package that kinda has limited functionality in this sense (crayon). It’s doing its best :)

nord::nord_palettes %>% map(prismatic::color)

Bonus material!

Recreating a {palmerpenguins} plot

Let’s practice in real time!

# scatterplot sequence ----
penguins %>%
  ggplot() + 
  geom_point(aes(x = flipper_length_mm, y = bill_length_mm)) # add aesthetics

penguins %>%
  ggplot() +
  geom_point(aes(x = flipper_length_mm, y = bill_length_mm, 
                 color = species)) # add color per species

penguins %>%
  ggplot() +
  geom_point(aes(x = flipper_length_mm, y = bill_length_mm, 
                 color = species, shape = species)) # add shape per species

penguins %>%
  ggplot() +
  geom_point(aes(x = flipper_length_mm, y = bill_length_mm, 
                 color = species, shape = species)) # add shape per species

penguins %>%
  ggplot() +
  geom_point(aes(x = flipper_length_mm, y = bill_length_mm, 
                 color = species, shape = species)) +
  geom_smooth(aes(x = flipper_length_mm, y = bill_length_mm, 
                  color = species))

penguins %>%
  ggplot(aes(x = flipper_length_mm, y = bill_length_mm)) + 
  geom_point(aes(color = species, shape = species)) +
  geom_smooth(aes(color = species), se = FALSE, method = "lm")

penguins %>%
  ggplot() +
  geom_point(aes(x = flipper_length_mm, y = body_mass_g, 
                 color = species, shape = species))

penguins %>%
  ggplot() +
  geom_histogram(aes(x = flipper_length_mm))

penguins %>%
  ggplot() +
  geom_histogram(aes(x = flipper_length_mm, color = species))

penguins %>%
  ggplot() +
  geom_histogram(aes(x = flipper_length_mm, fill = species))

penguins %>%
  ggplot() +
  geom_histogram(aes(x = flipper_length_mm, fill = species, 
                     position = "identity", alpha = 0.5))

TidyTuesday

tidytuesdayR

# install.packages("tidytuesdayR")
# remotes::install_github("thebioengineer/tidytuesdayR")

library(tidytuesdayR)

# load the data
tt_data <- tt_load("2020-07-27") # error message
tt_data <- tt_load("2020-07-28")
tt_data <- tt_load(2020, week=31)

# take a peek
readme(tt_data)
print(tt_data)

Examples